Recent upward trends in natural disasters have caused incalculable loss of property, economic well-being, and life to millions all across the globe, especially in poor and underdeveloped nations. From fires in California and tornadoes in Kentucky, to flash floods in India and Nepal and a devastating earthquake in Haiti, natural disasters have become one of the greatest threats to a peaceful existence on earth\(^{[4]}\). It is natural to question the role of climate change and its precursors such as rising temperatures and emissions in the preponderance of such maleficent events. An understanding of the causes of the recent influx of natural disasters as well as potential solutions necessitate an investigation into recent trends in climate data.
The motivation for this report is to provide a better understanding of the link between drivers and correlating factors of climate change and natural disaster patterns. We will observe data from 1900 to the present in order to detect trends and predict future outcomes. Descriptive and inferential analyses will be performed on various types and quantities of natural disasters over the factors of time, global average land and ocean temperatures, and \(CO_2\) and methane emissions. In performing these analyses we hope to discover any correlation between natural disasters and high vs. low temperatures and emission levels and the establish a trend that will help us know what to expect in the future. The scope of this paper is to assess factors that cause or are correlated with the characteristics of natural disasters. Future studies should make use of the information here to establish best solutions to the problem of increasing natural disaster occurrences.
This report utilizes three datasets. The first contains information about natural disasters occurring globally from 1900 to 2021. We utilize the following variables:
The second dataset contains land and ocean average temperatures from 1899 to 2015. We utilize the following variables:
We average the temperature data over 12-month periods to get data at yearly granularity to match our other datasets.
The third dataset contains emissions data for each year from 1750 to 2020. We utilize the following variables:
Various statistical methods were used to present our data in an informative way, as well as perform analysis on historical data and make future predictions.
Graphical representations of our data, such as histograms, scatterplots, and curvilinear regression plots are used throughout the report. We used linear regression modeling to fit a relationship between average land temperatures and quantity of natural disasters. Time series analyses were performed to observe trends in temperature over time and make predictions about the future. We performed ANOVA to determine whether or not different types of natural disasters occurred at different temperatures or different levels of emissions. Finally, we performed a QDA analysis to see if we could accurately predict which type of disasters would be most likely to occur under which temperature/emissions conditions.
All of the coding for the project was done in R using RStudio. Plots were made using Base R, ggplot2, and plotly. We used several packages including dplyr for data manipulations, forecast for time series analysis, and MASS and caret for supervised classification modeling.
We see that the yearly number of global natural disasters increases roughly exponentially from 1910 to 2010 before finally falling down a bit in the 2010s. The number appears low for the 2020s but we are only two years into this period, and the data captured only goes through 2021.
Looking at data more locally, we see that California fires have mostly been on an upward trend but not at an exponential rate like total global natural disasters. It is noteable that there are no data on fires in California from before 1970 and from 1975 to 1990. Since this is obviously not the case in actuality, we can postulate that data on fires in California is incomplete.
This graphic includes a scatterplot of the global land average temperatures for each year as well as a regression line drawn through them with the shaded region representing a 95% confidence interval. There is a clear upward trend in average temperature over time since 1900, which probably comes with little surprise. One interesting thing to point out is the massive outlier around the year 1900. It is worth looking into whether this is due to some anomolous event or if it is just a data entry error.
This plot shows the trend between total number of disasters and average land temperature. As in the previous plot, the points represent actual data points whereas the curve is a fitted regression model. It appears that the relationship between temperature and number of disasters follows a logistic trend defined by the following function:
\[\begin{equation} f(x) = \frac{L}{1 + e^{-k(x-x_0)}} \end{equation}\]
where
L is the curve’s maximum value
\(x_0\) is the x value of the function’s midpoint
k is the logistic growth rate\(^{[5]}\)
In essence, the number of disasters is roughly constantly low for temperatures on the lower end, increases roughly linearly for temperatures in the middle, and is roughly constantly high for temperatures on the higher end.
We attempt to model the data using two methods. The first plot above shows an attempt at using the sigmoid generating function in R. However, it does not generate the S-shaped curve that we are looking for. Nonetheless, it is not a bad fit for our data.
The second plot utilizes splines, which are locally generated cubics. We experimented with the parameter nknots for the R function smooth.spline and determined that a value of 6 produced the best fit. While the first plot was not bad, the method of splines generated a model that much more closely matches the S-shaped curve we were looking for.
The parameter estimates for the generated spline model are as follows:
Smoothing Parameter s=0.2129022 (\(\lambda\)=0.0016508)
Cubic Coefficients: -2.730562, 6.3536438, 24.3408847, 45.6597687, 121.6594541, 780.9746029, 848.1691568, 891.1093324
We obtain the ACF and PACF plots for the raw data and two different methods of time series generation. ACF and PACF plots can be used to determine the Moving Average (MA) and Auto-Regressive (AR) components, respectively, of a time series model. The ACF and PACF plots for the original data, “rough” part of the data, and first difference of the data (i.e. \(Y_t = X_t - X_{t-1}\), where \(X_t\) is the original data series) are displayed above. The first ACF/PACF plots for the original series show no discernable MA component and a possible AR component of order 2.
However, We can probably improve upon this model by analyzing three different methods of handling time series data. The first method involves subtracting off the “trend” part of the data and analyzing the remaining “rough” part of the data. From the ACF plot, it can be determined that the rough part has an MA component of order 3. The PACF plot gives no relevant information.
The second method involves taking the first difference of the data and then analyzing \(Y_t\). In this case, the PACF plot shows that this first difference series probably has an AR component of order 3 while the ACF plot gives no relevant information.
Finally, for the third method we utilize the R function auto.arima with parameter seasonal set to TRUE. This will automatically generate the “best” model according to the algorithm used in auto.arima, and will check for any seasonal components. Since we don’t need to calculate the order of the AR and MA components, as this is handled by R, we do not need the ACF and PACF plots.
Here we display the results of fitting the three previously described models against the historical data in the left column, and the forecasted data obtained from each model in the right column.
The auto-generated seasonal model produces a smoother plot than the rough plus trend and first difference models. This is not necessarily a good thing, however, because we would like to keep the individual fluctuations present in the original data when making new predictions. Looking at AIC values may help us decide which model is best for forecasting. The AIC values for the rough plus trend, first difference, and seasonal models are -115.8025208, -58.0940606, and -61.4049308, respectively. We choose the best model by choosing the model with the lowest AIC value - in this case the rough plus trend model.
The ten-year forecasts for the three models look significantly different. Rough plus trend shows variations only up to the third year and then simply follows a straight upward-sloping line, corresponding to the trend. This is due to the fact that MA models of order q only provide random variance up to the \(q^{th}\) forecasted year. The first difference forecast shows some amount of variance through all ten forecasted years while also seeming to follow the trend. This makes it seem promising in its predictive value. The seasonal model forecast is a constantly upward-sloping line which follows the trend but provides no random variance.
Based on the AIC criterion and the forecast plots, it seems that we can rule out the seasonal model. The first difference model appears to be best if we need a model that provides accurate yearly fluctuations for more than three years, whereas the rough plus trend model seems to be the most likely to give the most accurate average prediction.
We plot the fitted historical data for the past ten years and the forecasted data for the next ten years using the rough plus trend and first difference models above. The red line indicates the point at which we go from fitting historical data to forecasting future years. By zooming in on the data, we can see that both the fitted historical values and the forecasted values differ for the two models.
As we can see from the plots above, \(CO_2\) emissions and land and land-and-ocean average temperatures have risen since 1990. The overall trend is the same for methane, but with a noticeable dip between 1990 and around 1996.
The data that we have available for \(\textbf{all}\) of these variables is on a shorter time period than for temperature alone (smaller sample size), so we must be sure to handle the data carefully and make sure no assumptions are violated in our analyses.
We create plots of all three predictor variables against one another in order to determine if there is multicollinearity. Strong multicollinearity appears to exist between land and land-and-ocean average temperatures and between \(CO_2\) and methane. Because of this, we should only keep one variable from each group; we choose to keep \(CO_2\) and land average temperature.
The two plots above are a representation of average land temperature and \(CO_2\) emissions data points for each natural disaster type. There isn’t a very clear way to categorize natural disasters by the temperature and \(CO_2\) emissions that we can observe by simply looking at these plots. Therefore we will require more analytical methods to determine a way to separate them.
One thing to notice from the color-coded plot of temperature is that it appears that, though usually directly proportional to emissions levels, there are some higher temperature points that occur near the middle of the emissions spectrum. Therefore, emissions cannot be the only driver of increases in global temperature.
Here we separate out the factors of \(CO_2\) emissions and average land temperature for each disaster type into two boxplots. For the first plot, it appears most disaster types have roughly similar median temperatures, though there are some deviations. We also observe a few outliers for the disaster types “Extreme temperature” (somewhat obviously), “Flood”, and “Insect infestation”.
For the second plot, the median \(CO_2\) levels do seem a bit more spread out than for temperature. It is important to mention, however, that while differences observed between medians among boxplots may indicate a noteable difference in the true mean values of the measured variable, they do not necessarily indicate a statistically significant difference.
We removed disaster categories with only 2 observations (Animal accidents and Impacts) and ran Henze-Zirkler tests on each of the disaster categories to test for multivariate normality and Box’s M test for equality of covariance. We find that none of the categories have normal distributions and they do not have equal covariance (p=0.05).
Since neither normality nor equal covariance holds, this rules out using Linear Discriminant Analysis (LDA). Quadratic Discriminant Analysis (QDA) is similar to LDA except that it doesn’t require the assumption of equal covariance and can be robust to non-normal data. Another method that could work well for non-normal unequal covariance data would be Support Vector Machines (SVM).
The QDA method involves finding the class, k, which maximizes the following function:
\[\begin{equation} \delta_k(x) = -\frac{1}{2}log|\Sigma_k| - \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) + log\pi_k \end{equation}\]
Where
\(\Sigma_k\) is the variance-covariance matrix for class k
\(\mu_k\) is the mean of class k
\(\pi_k\) is the prior probability of class k\(^{[6]}\)
Before beginning our classification of disaster type by average land temperature and \(CO_2\) emission level, we filter “Animal accident” and “Impact” out of our dataset because they have a very small number of observations. We then randomly split all of our roughly 20,000 observations into a training set with 70% of the observations and a test set with the remaining 30% of observations. We train the training set on a QDA model and get the following results:
Call:
qda(Type ~ LandAverageTemperature + co2, data = train)
Prior probabilities of groups:
Drought Earthquake Epidemic
0.042493751 0.075283047 0.116379944
Extreme temperature Flood Insect infestation
0.047787090 0.360608734 0.002793707
Landslide Mass movement (dry) Storm
0.046904867 0.002646670 0.260476401
Volcanic activity Wildfire
0.013968534 0.030657256
Group means:
LandAverageTemperature co2
Drought 9.438884 28693.59
Earthquake 9.397477 27978.79
Epidemic 9.410260 27489.77
Extreme temperature 9.471152 29462.65
Flood 9.460585 29186.72
Insect infestation 9.280246 26417.09
Landslide 9.407602 28294.25
Mass movement (dry) 9.232944 27014.59
Storm 9.403640 28220.35
Volcanic activity 9.401119 27968.85
Wildfire 9.430425 27976.82
We then use our QDA model to predict the data from our test set and check our predictions against the actual observations. We obtain the following confusion matrix:
Confusion Matrix and Statistics
Reference
Prediction Drought Earthquake Epidemic Extreme temperature Flood
Drought 0 0 0 0 238
Earthquake 0 0 0 0 354
Epidemic 0 0 0 0 642
Extreme temperature 0 0 0 0 244
Flood 0 0 0 0 1926
Insect infestation 0 0 0 0 19
Landslide 0 0 0 0 222
Mass movement (dry) 0 0 0 0 6
Storm 0 0 0 0 1356
Volcanic activity 0 0 0 0 72
Wildfire 0 0 0 0 145
Reference
Prediction Insect infestation Landslide Mass movement (dry) Storm
Drought 0 0 0 26
Earthquake 0 0 0 50
Epidemic 0 0 0 51
Extreme temperature 0 0 0 14
Flood 0 0 0 193
Insect infestation 0 0 0 3
Landslide 0 0 0 40
Mass movement (dry) 0 0 0 4
Storm 0 0 0 183
Volcanic activity 0 0 0 18
Wildfire 0 0 0 24
Reference
Prediction Volcanic activity Wildfire
Drought 0 0
Earthquake 0 0
Epidemic 0 0
Extreme temperature 0 0
Flood 0 0
Insect infestation 0 0
Landslide 0 0
Mass movement (dry) 0 0
Storm 0 0
Volcanic activity 0 0
Wildfire 0 0
Overall Statistics
Accuracy : 0.3617
95% CI : (0.3494, 0.3742)
No Information Rate : 0.8961
P-Value [Acc > NIR] : 1
Kappa : 0.0133
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Drought Class: Earthquake Class: Epidemic
Sensitivity NA NA NA
Specificity 0.95472 0.9307 0.8811
Pos Pred Value NA NA NA
Neg Pred Value NA NA NA
Prevalence 0.00000 0.0000 0.0000
Detection Rate 0.00000 0.0000 0.0000
Detection Prevalence 0.04528 0.0693 0.1189
Balanced Accuracy NA NA NA
Class: Extreme temperature Class: Flood
Sensitivity NA 0.3687
Specificity 0.95575 0.6815
Pos Pred Value NA 0.9089
Neg Pred Value NA 0.1113
Prevalence 0.00000 0.8961
Detection Rate 0.00000 0.3304
Detection Prevalence 0.04425 0.3635
Balanced Accuracy NA 0.5251
Class: Insect infestation Class: Landslide
Sensitivity NA NA
Specificity 0.996226 0.95506
Pos Pred Value NA NA
Neg Pred Value NA NA
Prevalence 0.000000 0.00000
Detection Rate 0.000000 0.00000
Detection Prevalence 0.003774 0.04494
Balanced Accuracy NA NA
Class: Mass movement (dry) Class: Storm
Sensitivity NA 0.30198
Specificity 0.998285 0.74043
Pos Pred Value NA 0.11891
Neg Pred Value NA 0.90142
Prevalence 0.000000 0.10395
Detection Rate 0.000000 0.03139
Detection Prevalence 0.001715 0.26398
Balanced Accuracy NA 0.52120
Class: Volcanic activity Class: Wildfire
Sensitivity NA NA
Specificity 0.98456 0.97101
Pos Pred Value NA NA
Neg Pred Value NA NA
Prevalence 0.00000 0.00000
Detection Rate 0.00000 0.00000
Detection Prevalence 0.01544 0.02899
Balanced Accuracy NA NA
We can see from the confusion matrix that the model predicted every test point to of type “Flood” or “Storm”, which led to a prediction accuracy of only 36.17%. There could be several reasons for this behavior including the number of observations from each class being widely different (imbalanced data) or the mean values of the two predictors being too similar among the classes.
For this reason, we chose to replicate the test with a subset of the original observations. We took 500 randomized observations from each of the disaster types “Flood”, “Wildfire”, and “Earthquake”. These were chosen because we hypothesized that they were events that would occur in different climatological settings (i.e. different mean temperature and \(CO_2\) levels) and they had enough observations to obtain a reasonably-sized and equal sample from each.
Repeating the test with this subset yielded the following results:
Confusion Matrix and Statistics
Reference
Prediction Earthquake Flood Wildfire
Earthquake 12 46 93
Flood 13 52 89
Wildfire 7 34 104
Overall Statistics
Accuracy : 0.3733
95% CI : (0.3285, 0.4199)
No Information Rate : 0.6356
P-Value [Acc > NIR] : 1
Kappa : 0.066
Mcnemar's Test P-Value : <2e-16
Statistics by Class:
Class: Earthquake Class: Flood Class: Wildfire
Sensitivity 0.37500 0.3939 0.3636
Specificity 0.66746 0.6792 0.7500
Pos Pred Value 0.07947 0.3377 0.7172
Neg Pred Value 0.93311 0.7297 0.4033
Prevalence 0.07111 0.2933 0.6356
Detection Rate 0.02667 0.1156 0.2311
Detection Prevalence 0.33556 0.3422 0.3222
Balanced Accuracy 0.52123 0.5366 0.5568
We can see that we now get predictions in all three categories but the prediction accuracy only increased by just over 1% to 37.33%. This is still not an optimal prediction accuracy, so other solutions to the classification problem should be attempted.
Since QDA didn’t give optimal results we attempt SVM in case the QDA model was not robust to the violation of normality of the data.
Confusion Matrix and Statistics
Reference
Prediction Drought Earthquake Epidemic Extreme temperature Flood
Drought 0 0 0 0 220
Earthquake 0 0 0 0 315
Epidemic 0 0 0 0 625
Extreme temperature 0 0 0 0 230
Flood 0 0 0 0 1856
Insect infestation 0 0 0 0 17
Landslide 0 0 0 0 217
Mass movement (dry) 0 0 0 0 5
Storm 0 0 0 0 1235
Volcanic activity 0 0 0 0 67
Wildfire 0 0 0 0 139
Reference
Prediction Insect infestation Landslide Mass movement (dry) Storm
Drought 0 0 0 44
Earthquake 0 0 0 89
Epidemic 0 0 0 68
Extreme temperature 0 0 0 28
Flood 0 0 0 263
Insect infestation 0 0 0 5
Landslide 0 0 0 45
Mass movement (dry) 0 0 0 5
Storm 0 0 0 304
Volcanic activity 0 0 0 23
Wildfire 0 0 0 30
Reference
Prediction Volcanic activity Wildfire
Drought 0 0
Earthquake 0 0
Epidemic 0 0
Extreme temperature 0 0
Flood 0 0
Insect infestation 0 0
Landslide 0 0
Mass movement (dry) 0 0
Storm 0 0
Volcanic activity 0 0
Wildfire 0 0
Overall Statistics
Accuracy : 0.3705
95% CI : (0.3581, 0.383)
No Information Rate : 0.8449
P-Value [Acc > NIR] : 1
Kappa : 0.0344
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Drought Class: Earthquake Class: Epidemic
Sensitivity NA NA NA
Specificity 0.95472 0.9307 0.8811
Pos Pred Value NA NA NA
Neg Pred Value NA NA NA
Prevalence 0.00000 0.0000 0.0000
Detection Rate 0.00000 0.0000 0.0000
Detection Prevalence 0.04528 0.0693 0.1189
Balanced Accuracy NA NA NA
Class: Extreme temperature Class: Flood
Sensitivity NA 0.3768
Specificity 0.95575 0.7091
Pos Pred Value NA 0.8759
Neg Pred Value NA 0.1727
Prevalence 0.00000 0.8449
Detection Rate 0.00000 0.3184
Detection Prevalence 0.04425 0.3635
Balanced Accuracy NA 0.5429
Class: Insect infestation Class: Landslide
Sensitivity NA NA
Specificity 0.996226 0.95506
Pos Pred Value NA NA
Neg Pred Value NA NA
Prevalence 0.000000 0.00000
Detection Rate 0.000000 0.00000
Detection Prevalence 0.003774 0.04494
Balanced Accuracy NA NA
Class: Mass movement (dry) Class: Storm
Sensitivity NA 0.33628
Specificity 0.998285 0.74929
Pos Pred Value NA 0.19753
Neg Pred Value NA 0.86017
Prevalence 0.000000 0.15506
Detection Rate 0.000000 0.05214
Detection Prevalence 0.001715 0.26398
Balanced Accuracy NA 0.54279
Class: Volcanic activity Class: Wildfire
Sensitivity NA NA
Specificity 0.98456 0.97101
Pos Pred Value NA NA
Neg Pred Value NA NA
Prevalence 0.00000 0.00000
Detection Rate 0.00000 0.00000
Detection Prevalence 0.01544 0.02899
Balanced Accuracy NA NA
Confusion Matrix and Statistics
Reference
Prediction Earthquake Flood Wildfire
Earthquake 43 42 66
Flood 34 67 53
Wildfire 33 35 77
Overall Statistics
Accuracy : 0.4156
95% CI : (0.3696, 0.4626)
No Information Rate : 0.4356
P-Value [Acc > NIR] : 0.816683
Kappa : 0.1252
Mcnemar's Test P-Value : 0.001419
Statistics by Class:
Class: Earthquake Class: Flood Class: Wildfire
Sensitivity 0.39091 0.4653 0.3929
Specificity 0.68235 0.7157 0.7323
Pos Pred Value 0.28477 0.4351 0.5310
Neg Pred Value 0.77592 0.7399 0.6098
Prevalence 0.24444 0.3200 0.4356
Detection Rate 0.09556 0.1489 0.1711
Detection Prevalence 0.33556 0.3422 0.3222
Balanced Accuracy 0.53663 0.5905 0.5626
Again, we only get observations classified as Flood and Storm when classifying all disaster types with only a less than 1% increase in accuracy from the QDA model to 37.05%. It should also be mentioned that due to SVM’s increased complexity with more categories of the prediction variable, the model takes quite a bit longer to train than the QDA model. However, the reduced dataset (500 observations each of Earthquake, Flood, and Wildfire) has 41.56% accuracy, an over 4% increase from the reduced QDA model, and doesn’t take significantly longer to train.
A final attempted solution is to artificially balance the data. We try both upsampling the minority classes to the size of the majority class and mixed upsampling and downsampling so that all classes have 500 observations. We get the following results:
Confusion Matrix and Statistics
Reference
Prediction Drought Earthquake Epidemic Extreme temperature Flood
Drought 136 0 715 296 376
Earthquake 71 0 553 262 391
Epidemic 28 0 921 254 349
Extreme temperature 53 0 481 619 410
Flood 99 0 488 407 472
Insect infestation 0 0 410 0 212
Landslide 89 0 608 233 537
Mass movement (dry) 99 0 172 100 307
Storm 105 0 519 339 373
Volcanic activity 96 0 516 301 447
Wildfire 95 0 869 326 275
Reference
Prediction Insect infestation Landslide Mass movement (dry) Storm
Drought 85 39 343 0
Earthquake 196 31 420 0
Epidemic 111 82 222 0
Extreme temperature 139 18 234 0
Flood 133 58 269 0
Insect infestation 1131 0 312 0
Landslide 144 111 264 0
Mass movement (dry) 106 0 1359 0
Storm 165 78 413 0
Volcanic activity 169 75 444 0
Wildfire 97 42 199 0
Reference
Prediction Volcanic activity Wildfire
Drought 60 56
Earthquake 65 124
Epidemic 19 56
Extreme temperature 40 121
Flood 50 95
Insect infestation 0 0
Landslide 60 91
Mass movement (dry) 0 0
Storm 62 67
Volcanic activity 99 30
Wildfire 92 95
Overall Statistics
Accuracy : 0.2132
95% CI : (0.208, 0.2186)
No Information Rate : 0.2697
P-Value [Acc > NIR] : 1
Kappa : 0.1353
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Drought Class: Earthquake Class: Epidemic
Sensitivity 0.156142 NA 0.14731
Specificity 0.911695 0.90884 0.93378
Pos Pred Value 0.064577 NA 0.45103
Neg Pred Value 0.965123 NA 0.74780
Prevalence 0.037575 0.00000 0.26972
Detection Rate 0.005867 0.00000 0.03973
Detection Prevalence 0.090854 0.09116 0.08809
Balanced Accuracy 0.533919 NA 0.54055
Class: Extreme temperature Class: Flood
Sensitivity 0.19732 0.11376
Specificity 0.92536 0.91598
Pos Pred Value 0.29267 0.22791
Neg Pred Value 0.88047 0.82581
Prevalence 0.13533 0.17899
Detection Rate 0.02670 0.02036
Detection Prevalence 0.09124 0.08934
Balanced Accuracy 0.56134 0.51487
Class: Insect infestation Class: Landslide
Sensitivity 0.45679 0.207865
Specificity 0.95489 0.910536
Pos Pred Value 0.54770 0.051942
Neg Pred Value 0.93630 0.979898
Prevalence 0.10682 0.023037
Detection Rate 0.04879 0.004789
Detection Prevalence 0.08909 0.092192
Balanced Accuracy 0.70584 0.559201
Class: Mass movement (dry) Class: Storm
Sensitivity 0.30342 NA
Specificity 0.95808 0.9085
Pos Pred Value 0.63416 NA
Neg Pred Value 0.85169 NA
Prevalence 0.19323 0.0000
Detection Rate 0.05863 0.0000
Detection Prevalence 0.09245 0.0915
Balanced Accuracy 0.63075 NA
Class: Volcanic activity Class: Wildfire
Sensitivity 0.180987 0.129252
Specificity 0.908187 0.911116
Pos Pred Value 0.045475 0.045455
Neg Pred Value 0.978670 0.969654
Prevalence 0.023598 0.031708
Detection Rate 0.004271 0.004098
Detection Prevalence 0.093917 0.090164
Balanced Accuracy 0.544587 0.520184
Confusion Matrix and Statistics
Reference
Prediction Drought Earthquake Epidemic Extreme temperature Flood
Drought 15 7 37 17 22
Earthquake 9 5 36 19 21
Epidemic 0 1 70 15 24
Extreme temperature 1 5 37 30 35
Flood 12 2 39 26 26
Insect infestation 0 0 30 0 9
Landslide 4 0 39 18 36
Mass movement (dry) 7 14 15 11 15
Storm 5 7 39 23 21
Volcanic activity 8 5 34 7 32
Wildfire 4 2 63 10 19
Reference
Prediction Insect infestation Landslide Mass movement (dry) Storm
Drought 8 14 14 0
Earthquake 14 5 21 0
Epidemic 9 9 11 0
Extreme temperature 9 12 19 0
Flood 10 5 22 0
Insect infestation 74 10 28 0
Landslide 10 14 15 0
Mass movement (dry) 5 0 92 0
Storm 11 7 28 0
Volcanic activity 8 8 27 0
Wildfire 8 5 14 0
Reference
Prediction Volcanic activity Wildfire
Drought 5 2
Earthquake 11 8
Epidemic 7 5
Extreme temperature 2 10
Flood 4 3
Insect infestation 0 0
Landslide 8 7
Mass movement (dry) 0 0
Storm 6 7
Volcanic activity 9 3
Wildfire 10 10
Overall Statistics
Accuracy : 0.209
95% CI : (0.1896, 0.2294)
No Information Rate : 0.2659
P-Value [Acc > NIR] : 1
Kappa : 0.1287
Mcnemar's Test P-Value : <2e-16
Statistics by Class:
Class: Drought Class: Earthquake Class: Epidemic
Sensitivity 0.230769 0.104167 0.15945
Specificity 0.920555 0.910168 0.93317
Pos Pred Value 0.106383 0.033557 0.46358
Neg Pred Value 0.966887 0.971372 0.75400
Prevalence 0.039370 0.029073 0.26590
Detection Rate 0.009085 0.003028 0.04240
Detection Prevalence 0.085403 0.090248 0.09146
Balanced Accuracy 0.575662 0.507168 0.54631
Class: Extreme temperature Class: Flood
Sensitivity 0.17045 0.10000
Specificity 0.91186 0.91157
Pos Pred Value 0.18750 0.17450
Neg Pred Value 0.90208 0.84421
Prevalence 0.10660 0.15748
Detection Rate 0.01817 0.01575
Detection Prevalence 0.09691 0.09025
Balanced Accuracy 0.54116 0.50579
Class: Insect infestation Class: Landslide
Sensitivity 0.44578 0.15730
Specificity 0.94815 0.91229
Pos Pred Value 0.49007 0.09272
Neg Pred Value 0.93867 0.95000
Prevalence 0.10055 0.05391
Detection Rate 0.04482 0.00848
Detection Prevalence 0.09146 0.09146
Balanced Accuracy 0.69697 0.53480
Class: Mass movement (dry) Class: Storm
Sensitivity 0.31615 NA
Specificity 0.95074 0.90672
Pos Pred Value 0.57862 NA
Neg Pred Value 0.86662 NA
Prevalence 0.17626 0.00000
Detection Rate 0.05572 0.00000
Detection Prevalence 0.09631 0.09328
Balanced Accuracy 0.63344 NA
Class: Volcanic activity Class: Wildfire
Sensitivity 0.145161 0.181818
Specificity 0.916929 0.915414
Pos Pred Value 0.063830 0.068966
Neg Pred Value 0.964901 0.970120
Prevalence 0.037553 0.033313
Detection Rate 0.005451 0.006057
Detection Prevalence 0.085403 0.087826
Balanced Accuracy 0.531045 0.548616
We get accuracies of 21.32% and 20.9% for the SVM models using upsampled and mixed up/downsampled data, respectively. These are significant reductions from all of the previous methods. We still see two disaster types (Earthquake and Storm) not predicted by the upsampled model and one type (Storm) not classified by the mixed up/downsampled model. The one benefit of using these models appears to be that many of the disaster types have non-null and reasonably high sensitivity and specificity values, which previous models including all disaster types did not. This means that the up/downsampled models may do a better job of differentiating between the disaster types.
In our analysis of the three continuous factors of number of disasters, year, and land average temperature we found all three to be positively related. Since the 1900s, global disasters have increased exponentially while temperature increased mostly linearly except between around 1945 and 1975. Perhaps this time period should be studied in more depth to understand what factors may have led temperatures to stop increasing. It is important to consider outliers and verify that data is complete for all of these factors. We found the spline method to be the best way to generate a model of the S-shaped regression curve of number of disasters over temperature.
We attempted three ways of modeling the time series of average land temperature: rough plus trend, first difference, and auto-generated seasonal. We got the best results with regard to AIC criterion from the rough plus trend model and the best looking forecast from the first difference model. It could be useful to re-attempt a seasonal model using a monthly level of granularity instead of yearly to capture actual variations due to the seasons (i.e. Fall, Winter, Spring, Summer).
Our categorization of disaster type based on land average temperature, \(CO_2\) emissions, and methane emissions proved difficult as there were problems of multicollinearity, non-normality of data, similarity of mean levels of predictors for each disaster type, and different numbers of each disaster type recorded. It would be helpful to look at more predictors that are less correlated with one another and for which the means for each disaster type differ more. These could include factors like geographical factors, wind speed, atmospheric measures, and sunlight.
The discoveries made in this report lead us to believe that more research must be done to discover which of these factors and other factors may be responsible for causing the increase in natural disasters, rather than just being correlated. It is our hope that the information in this report may provide a baseline for others to answer this question as well as figure out potential solutions to the problem.
https://www.kaggle.com/brsdincer/all-natural-disasters-19002021-eosdis - Natural disasters dataset
https://www.kaggle.com/amelinvladislav/map-of-temperatures-and-analysis-of-global-warming/data?select=GlobalTemperatures.csv - Temperature dataset
https://ourworldindata.org/co2-emissions (https://github.com/owid/co2-data) - Greenhouse gas emissions dataset
https://link.springer.com/article/10.1007/s10113-013-0499-2 - EURO-CORDEX projections study
https://www.liebertpub.com/doi/pdf/10.1089/big.2014.0026 - Using big data for climate change
https://stackoverflow.com/questions/49719660/r-markdown-to-pdf-printing-console-output - Used for console printing function
# Load libraries and data
library(datetime)
library(dplyr)
library(ggplot2)
library(sigmoid)
library(cowplot)
library(plotly)
library(forecast)
library(MASS)
library(caret)
library(nnet)
#library(survey)
library(effects)
library(tidyr)
library(MVN)
library(e1071)
library(imbalance)
knitr::opts_chunk$set(comment = NA)
disasters1.data <- read.csv('./data/DISASTERS/1900_2021_DISASTERS.xlsx - emdat data.csv')
disasters2.data <- read.csv('./data/DISASTERS/1970-2021_DISASTERS.xlsx - emdat data.csv')
temp.data <- read.csv('./data/GlobalTemperatures.csv')
emissions.data <- read.csv('data/owid-co2-data.txt', header = TRUE)
# Combine data from two disasters datasets
common <- match(colnames(disasters1.data), colnames(disasters2.data))
disasters2.data <- disasters2.data[,common]
disasters.data <- rbind(disasters1.data, disasters2.data)
# Histogram of all disasters per year
hist(disasters.data$Year, xlab = "Year", main = "Total Global Disasters per Year")
# Get data on CA fires
fire.data <- disasters.data[disasters.data$Disaster.Type == "Wildfire",]
row.condition <- Reduce(union, list(grep("cali", fire.data$Location, ignore.case = TRUE),
grep("los angeles", fire.data$Location, ignore.case = TRUE),
grep("Oakland", fire.data$Location),
grep("San Diego", fire.data$Location)))
cali.fires <- fire.data[row.condition,]
# Histogram of CA fires per year
hist(cali.fires$Year, xlab = "Year", main = "California Fires per Year")
# Get temperature data in desired format
temp.data.mod <- temp.data
temp.data.mod$dt <- as.date(temp.data.mod$dt, format = "%Y-%m-%d")
temp.data.mod <- temp.data.mod[temp.data.mod$dt > as.date("1899", format = "%Y"),]
temp.data.mod$dt <- as.numeric(format(temp.data.mod$dt, format = "%Y"))
temp.data.mod <- temp.data.mod %>% group_by(dt) %>% summarize(LandAverageTemperature = mean(LandAverageTemperature),
LandAverageTemperatureUncertainty = mean(LandAverageTemperatureUncertainty),
LandAndOceanAverageTemperature = mean(LandAndOceanAverageTemperature),
LandAndOceanAverageTemperatureUncertainty = mean(LandAndOceanAverageTemperatureUncertainty))
# Plot of average land temperature over time
ggplot(data = temp.data.mod, mapping = aes(x = dt, y = LandAverageTemperature)) +
ylim(min(temp.data.mod$LandAverageTemperature), max(temp.data.mod$LandAverageTemperature)) +
labs(title = "Land Average Temperature by Year", x = "Year", y = "Temperature") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_point() + geom_smooth()
# Get number of disasters per year up to 2016
disasters.data.mod <- disasters.data
disasters.data.mod <- disasters.data.mod[disasters.data.mod$Year < 2016,]
disasters.data.mod <- disasters.data.mod %>% group_by(Year) %>% summarize(n = n())
# Join temperature and disaster datasets
temp.and.disasters <- dplyr::inner_join(temp.data.mod, disasters.data.mod, by = c("dt" = "Year"))
# Plot of number of disasters against average land temperature
ggplot(data = temp.and.disasters, mapping = aes(x = LandAverageTemperature, y = n)) +
ylim(min(temp.and.disasters$n), max(temp.and.disasters$n)) +
labs(title = "Number of Disasters for each Temperature", x = "Temperature", y = "Disasters") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_point() + geom_smooth()
# Linear model with response number of disasters and predictor average land temperature using sigmoid
num.disasters.v.temp <- lm(n ~ sigmoid::sigmoid(LandAverageTemperature - 50, "logistic"), data = temp.and.disasters)
p1 <- ggplot(data = temp.and.disasters) +
labs(title = "Land Average Temperature by Year (Fitted)", x = "Temperature", y = "Disasters") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_point(mapping = aes(x = LandAverageTemperature, y = n)) +
geom_line(mapping = aes(x = LandAverageTemperature, y = num.disasters.v.temp$fitted.values), size = 1, color = 'blue')
# Fit using splines
spline.model <- smooth.spline(x = temp.and.disasters$LandAverageTemperature, y = temp.and.disasters$n, nknots = 6)
p2 <- ggplot(data = temp.and.disasters) +
labs(x = "Temperature", y = "Disasters") +
geom_point(mapping = aes(x = LandAverageTemperature, y = n)) +
geom_line(mapping = aes(x = spline.model$x, y = spline.model$y), size = 1, color = 'blue')
plot_grid(p1, p2, ncol = 1)
## Time series projections ##
# Fit a trend model of land temperatures against time
temp.trend <- smooth.spline(x = temp.and.disasters$dt, y = temp.and.disasters$LandAverageTemperature, nknots = 5)
# Plot ACF and PACF for raw data, first difference of data, and data with trend removed
orig.par <- par()
par(mfrow = c(3,2), mar = c(3,4,3,2))
acf(temp.and.disasters$LandAverageTemperature, main = "ACF (raw data)")
pacf(temp.and.disasters$LandAverageTemperature, main = "PACF (raw data)")
acf((temp.and.disasters$LandAverageTemperature - temp.trend$y), main = "ACF (rough part)")
pacf((temp.and.disasters$LandAverageTemperature - temp.trend$y), main = "PACF (rough part)")
acf(diff(temp.and.disasters$LandAverageTemperature, lag = 1), main = "ACF (first difference)")
pacf(diff(temp.and.disasters$LandAverageTemperature, lag = 1), main = "PACF (first difference)")
par(orig.par)
# Generate rough, first difference, and seasonal models
temp.ts.rough <- arima((temp.and.disasters$LandAverageTemperature - temp.trend$y), order = c(0,0,3))
temp.ts.diff <- arima(temp.and.disasters$LandAverageTemperature, order = c(3,1,0))
temp.ts.seas <- auto.arima(temp.and.disasters$LandAverageTemperature, seasonal = TRUE)
# Generate forecasts for the next 10 years from each model
temp.fc.rough <- forecast(temp.ts.rough, h = 10)
temp.fc.diff <- forecast(temp.ts.diff, h = 10)
temp.fc.seas <- forecast(temp.ts.seas, h = 10)
temp.fc.rpt <- temp.fc.rough$mean + predict(temp.trend$fit, 2016:2025)$y
# Plot the fitted models and forecasts
par(mfrow = c(3,2), mar = c(3,4,3,2))
plot(x = temp.and.disasters$dt, y = temp.and.disasters$LandAverageTemperature, main = "Rough Plus Trend (Fitted)", xlab = "Time", ylab = "Temperature", type = "n")
lines(temp.and.disasters$dt, temp.and.disasters$LandAverageTemperature, col = "black")
lines(temp.and.disasters$dt, fitted(temp.ts.rough) + temp.trend$y, col = "blue", lwd = 2)
plot.ts(temp.fc.rpt, main = "Rough Plus Trend (Forecasts)", xlab = "Time", ylab = "Temperature")
plot(x = temp.and.disasters$dt, y = temp.and.disasters$LandAverageTemperature, main = "First Difference (Fitted)", xlab = "Time", ylab = "Temperature", type = "n")
lines(temp.and.disasters$dt, temp.and.disasters$LandAverageTemperature, col = "black")
lines(temp.and.disasters$dt, fitted(temp.ts.diff), col = "blue", lwd = 2)
plot.ts(temp.fc.diff$mean, main = "First Difference (Forecasts)", xlab = "Time", ylab = "Temperature")
plot(x = temp.and.disasters$dt, y = temp.and.disasters$LandAverageTemperature, main = "Seasonal Model (Fitted)", xlab = "Time", ylab = "Temperature", type = "n")
lines(temp.and.disasters$dt, temp.and.disasters$LandAverageTemperature, col = "black")
lines(temp.and.disasters$dt, fitted(temp.ts.seas), col = "blue", lwd = 2)
plot.ts(temp.fc.seas$mean, main = "Seasonal Model (Forecasts)", xlab = "Time", ylab = "Temperature")
par(orig.par)
# Plot forecast and historical data (last 10 years) from assumed best models
combined.rough.ts <- ts(c(temp.trend$y[107:116]+fitted(temp.ts.rough)[107:116], temp.fc.rpt))
p1 <- ggplot() +
labs(title = "Rough Plus Trend (Historical and Forecast)", x = "Year", y = "Temperature") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_line(mapping = aes(x = 2007:2026, y = combined.rough.ts)) +
geom_vline(xintercept = 2017, color = 'red')
combined.diff.ts <- ts(c(fitted(temp.ts.diff)[107:116], temp.fc.diff$mean))
p2 <- ggplot() +
labs(title = "First Difference (Historical and Forecast)", x = "Year", y = "Temperature") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_line(mapping = aes(x = 2007:2026, y = combined.diff.ts)) +
geom_vline(xintercept = 2017, color = 'red')
plot_grid(p1, p2, ncol = 1)
## Predictors of types of natural disasters ##
# Modify emissions data to only include global data and select needed variables
emissions.data.mod <- emissions.data %>% dplyr::select(country, year, co2, methane) %>% filter(country == 'World')
# Plots of emissions/temperatures over time
title <- ggdraw() + draw_label("Temperature and Emissions Trends", fontface='bold')
p1 <- ggplot(data = emissions.data.mod[!is.na(emissions.data.mod$co2),]) +
labs(x = "Year", y = "Emissions (million tonnes CO2)") +
geom_smooth(mapping = aes(x = year, y = co2))
p2 <- ggplot(data = emissions.data.mod[!is.na(emissions.data.mod$methane),]) +
labs(x = "Year", y = "Emissions (million tonnes methane)") +
geom_smooth(mapping = aes(x = year, y = methane))
p3 <- ggplot(data = temp.data.mod) +
labs(x = "Year", y = "Land Temperature") +
geom_smooth(mapping = aes(x = dt, y = LandAverageTemperature))
p4 <- ggplot(data = temp.data.mod) +
labs(x = "Year", y = "Land and Ocean Temp") +
geom_smooth(mapping = aes(x = dt, y = LandAndOceanAverageTemperature))
p <- plot_grid(p1, p2, p3, p4, ncol = 2)
plot_grid(title, p, ncol = 1, rel_heights=c(0.4, 1))
# Combine natural disasters, temperature, and emissions datasets for years in which there is
# data for all three of them (1990-2016)
nat.disasters <- data.frame(Year = disasters.data$Year, Type = disasters.data$Disaster.Type) %>% filter(Year >= 1990 & Year <= 2016)
nat.disasters <- nat.disasters %>% left_join(temp.data.mod, by = c("Year" = "dt"), copy = TRUE) %>% left_join(emissions.data.mod, by = c("Year" = "year"), copy = TRUE)
# ANOVA on temperature, CO2, and methane
aov.temp <- aov(LandAverageTemperature ~ Type, data = nat.disasters, na.action = na.exclude)
tukey.temp <- TukeyHSD(aov.temp)
aov.co2 <- aov(co2 ~ Type, data = nat.disasters)
aov.methane <- aov(methane ~ Type, data = nat.disasters)
# Classification of disaster type based on temperature and CO2
pairs(nat.disasters[,c("LandAverageTemperature", "LandAndOceanAverageTemperature", "co2", "methane")])
# custom grid style
axx <- list(
gridcolor='rgb(255, 255, 255)',
zerolinecolor='rgb(255, 255, 255)',
showbackground=TRUE,
backgroundcolor='rgb(230, 230,230)'
)
create_plotly_fig <- function(data, cat, plotnum) {
cat_data = data %>% filter(type == cat)
cat_probs = matrix(cat_data$prob, nrow = 50, ncol = 50)
fig = plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~cat_probs, scene = paste("scene", plotnum, sep="")) %>% add_surface(showscale = FALSE)
return(fig)
}
create_scene <- function(cat, rid, cid, xcoord, ycoord, scene) {
return(list(domain=list(x=xcoord, y=ycoord, row=rid, column=cid),
xaxis=c(axx, id=scene),
yaxis=c(axx, id=scene),
zaxis=c(axx, title=paste(cat, "Probability"), id=scene),
aspectmode='cube'))
}
# TODO: filter out only Climatological subgroup
model_3d = multinom(as.factor(Type) ~ LandAverageTemperature + co2, data = nat.disasters, trace = FALSE)
mintemp = min(nat.disasters$LandAverageTemperature, na.rm = TRUE)
maxtemp = max(nat.disasters$LandAverageTemperature, na.rm = TRUE)
minco2 = min(nat.disasters$co2, na.rm = TRUE)
maxco2 = max(nat.disasters$co2, na.rm = TRUE)
xygrid = expand.grid(LandAverageTemperature = seq(mintemp, maxtemp, len=50), co2 = seq(minco2, maxco2, len=50))
z = as.data.frame(predict(model_3d, newdata = xygrid, type="probs"))
z = pivot_longer(z, 1:ncol(z), names_to = "type", values_to = "prob")
logit_figs = rep(0, 13)
i <- 1
storm_data = z %>% filter(type == "Storm")
storm_probs = matrix(storm_data$prob, nrow = 50, ncol = 50)
storms = plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~storm_probs, scene = paste("scene", 1, sep="")) %>% add_surface(showscale = FALSE)
flood_data = z %>% filter(type == "Flood")
flood_probs = matrix(flood_data$prob, nrow = 50, ncol = 50)
floods = plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~flood_probs, scene = paste("scene", 2, sep="")) %>% add_surface(showscale = FALSE)
wildfire_data = z %>% filter(type == "Wildfire")
wildfire_probs = matrix(wildfire_data$prob, nrow = 50, ncol = 50)
wildfires = plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~wildfire_probs, scene = paste("scene", 3, sep="")) %>% add_surface(showscale = FALSE)
earthquake_data = z %>% filter(type == "Earthquake")
earthquake_probs = matrix(earthquake_data$prob, nrow = 50, ncol = 50)
earthquakes = plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~earthquake_probs, scene = paste("scene", 4, sep="")) %>% add_surface(showscale = FALSE)
# individual plots
fig1 <- plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~storm_probs, scene='scene1')
fig1 <- fig1 %>% add_surface(showscale=FALSE)
fig2 <- plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~flood_probs, scene='scene2')
fig2 <- fig2 %>% add_surface(showscale=FALSE)
fig3 <- plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~wildfire_probs, scene='scene3')
fig3 <- fig3 %>% add_surface(showscale=FALSE)
fig4 <- plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~earthquake_probs, scene='scene4')
fig4 <- fig4 %>% add_surface(showscale=FALSE)
# subplot and define scene
fig <- subplot(fig1, fig2, fig3, fig4)
fig <- fig %>% layout(title = "Regressing Disasters on Temp and CO2",
scene = list(domain=list(x=c(0,0.5),y=c(0.5,1)),
xaxis=c(axx, title="Temp"),
yaxis=c(axx, title="CO2"),
zaxis=c(axx, title="Storm Probability"),
aspectmode='cube'),
scene2 = list(domain=list(x=c(0.5,1),y=c(0.5,1)),
xaxis=c(axx, title="Temp"),
yaxis=c(axx, title="CO2"),
zaxis=c(axx, title="Flood Probability"),
aspectmode='cube'),
scene3 = list(domain=list(x=c(0,0.5),y=c(0,0.5)),
xaxis=c(axx, title="Temp"),
yaxis=c(axx, title="CO2"),
zaxis=c(axx, title="Wildfire Probability"),
aspectmode='cube'),
scene4 = list(domain=list(x=c(0.5,1),y=c(0,0.5)),
xaxis=c(axx, title="Temp"),
yaxis=c(axx, title="CO2"),
zaxis=c(axx, title="Earthquake Probability"),
aspectmode='cube'))
fig
#plot_ly(data = nat.disasters, x = nat.disasters$LandAverageTemperature, y = nat.disasters$co2, z = nat.disasters$methane, type = "scatter3d") %>% add_surface(z = as.matrix(nat.disasters$methane))
ggplot(data = nat.disasters) +
labs(title = "Temperature and CO2 Emissions for Disaster Type", x = "Disaster Type", y = "Million Tonnes CO2") +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 9)) +
geom_point(mapping = aes(x = Type, y = co2, colour = LandAverageTemperature), size = 5) +
scale_color_gradient(low = "yellow", high = "red")
fig1 <- create_plotly_fig(z, "Animal accident", 1)
fig2 <- create_plotly_fig(z, "Drought", 2)
fig3 <- create_plotly_fig(z, "Earthquake", 3)
fig4 <- create_plotly_fig(z, "Epidemic", 4)
fig5 <- create_plotly_fig(z, "Extreme temperature ", 5)
fig6 <- create_plotly_fig(z, "Flood", 6)
fig7 <- create_plotly_fig(z, "Impact", 7)
fig8 <- create_plotly_fig(z, "Insect infestation", 8)
fig9 <- create_plotly_fig(z, "Landslide", 9)
fig10 <- create_plotly_fig(z, "Mass movement (dry)", 10)
fig11 <- create_plotly_fig(z, "Storm", 11)
fig12 <- create_plotly_fig(z, "Volcanic activity", 12)
fig13 <- create_plotly_fig(z, "Wildfire", 13)
# subplot and define scene
fig <- subplot(fig1, fig2, fig3, fig4, fig5, fig6, fig7, fig8, fig9, fig10, fig11, fig12, fig13, nrows = 4)
fig <- fig %>% layout(title = "Regressing Disasters on Temp and CO2",
scene = create_scene("Animal accident", 0, 0, c(0,0.5), c(1.5,2), 1),
scene2 = create_scene("Drought", 0, 1, c(0.5,1), c(1.5,2), 2),
scene3 = create_scene("Earthquake", 0, 2, c(1,1.5), c(1.5,2), 3),
scene4 = create_scene("Epidemic", 0, 3, c(1.5,2), c(1.5,2), 4),
scene5 = create_scene("Extreme temperature", 1, 0, c(0,0.5), c(1,1.5), 5),
scene6 = create_scene("Flood", 1, 1, c(0.5,1), c(1,1.5), 6),
scene7 = create_scene("Impact", 1, 2, c(1,1.5), c(1,1.5), 7),
scene8 = create_scene("Insect infestation", 1, 3, c(1.5,2), c(1,1.5), 8),
scene9 = create_scene("Landslide", 2, 0, c(0,0.5), c(0.5,1), 9),
scene10 = create_scene("Mass movement (dry)", 2, 1, c(0.5,1), c(0.5,1), 10),
scene11 = create_scene("Storm", 2, 2, c(1,1.5), c(0.5,1), 11),
scene12 = create_scene("Volcanic activity", 2, 3, c(1.5,2), c(0.5,1), 12),
scene13 = create_scene("Wildfire", 3, 0, c(0.75,1.25), c(0,0.5), 13)
)
fig
# Boxplots of temperatures and CO2 levels at which disasters occur
p1 <- ggplot(data = nat.disasters) +
geom_boxplot(mapping = aes(x = Type, y = LandAverageTemperature)) +
theme(axis.text.x = element_text(size = 5))
p2 <- ggplot(data = nat.disasters) +
geom_boxplot(mapping = aes(x = Type, y = co2)) +
theme(axis.text.x = element_text(size = 5))
plot_grid(p1, p2, ncol = 1)
# Obtained from https://stackoverflow.com/questions/49719660/r-markdown-to-pdf-printing-console-output
print_output <- function(output, cex = 0.5) {
tmp <- capture.output(output)
plot.new()
text(0, 1, paste(tmp, collapse='\n'), adj = c(0,1), family = 'mono', cex = cex)
box()
}
# CO2 and methane appear to be multicollinear and LandAverageTemperature and CO2 are both NOT normally distributed. Therefore, use qda.
nat.disasters.mod <- nat.disasters %>% filter(!(Type %in% c("Animal accident", "Impact"))) %>% dplyr::select(LandAverageTemperature, co2, Type) %>% drop_na()
mvns = c()
for (t in unique(nat.disasters.mod$Type)) {
d <- nat.disasters.mod %>% filter(Type == t) %>% dplyr::select(LandAverageTemperature, co2)
res <- mvn(data = d, mvnTest = "hz")
mvns <- append(mvns, res$multivariateNormality["MVN"])
}
names(mvns) <- unique(nat.disasters.mod$Type)
set.seed(123)
train.ind <- sample(1:nrow(nat.disasters.mod), size = 0.7 * nrow(nat.disasters.mod), replace = FALSE)
train <- nat.disasters.mod[train.ind,]
test <- nat.disasters.mod[-train.ind,]
temp.co2.qda <- qda(Type ~ LandAverageTemperature + co2, data = train)
#print_output(temp.co2.qda)
temp.co2.qda
pred.qda <- predict(temp.co2.qda, test)
conf.qda <- confusionMatrix(as.factor(test$Type), pred.qda$class)
#print_output(conf.qda, cex = 0.4)
conf.qda
set.seed(456)
nat.disasters.mod2 <- nat.disasters.mod %>% filter((Type %in% c("Flood", "Wildfire", "Earthquake"))) %>% dplyr::select(LandAverageTemperature, co2, Type) %>% group_by(Type) %>% drop_na() %>% slice_sample(n = 500)
train.ind.red <- sample(1:nrow(nat.disasters.mod2), size = 0.7 * nrow(nat.disasters.mod2), replace = FALSE)
train.red <- nat.disasters.mod2[train.ind.red,]
test.red <- nat.disasters.mod2[-train.ind.red,]
temp.co2.qda <- qda(Type ~ LandAverageTemperature + co2, data = train.red)
pred.qda <- predict(temp.co2.qda, test.red)
conf.qda <- confusionMatrix(as.factor(test.red$Type), pred.qda$class)
#print_output(conf.qda, cex = 0.4)
conf.qda
temp.co2.svm <- svm(as.factor(Type) ~ LandAverageTemperature + co2, data = train)
pred.svm <- predict(temp.co2.svm, test)
conf.svm <- confusionMatrix(as.factor(test$Type), pred.svm)
#print_output(conf.svm, cex = 0.4)
conf.svm
temp.co2.svm <- svm(as.factor(Type) ~ LandAverageTemperature + co2, data = train.red)
pred.svm <- predict(temp.co2.svm, test.red)
conf.svm <- confusionMatrix(as.factor(test.red$Type), pred.svm)
#print_output(conf.svm, cex = 0.4)
conf.svm
set.seed(789)
nat.disasters.mod3 <- nat.disasters.mod %>% dplyr::select(LandAverageTemperature, co2, Type) %>% drop_na()
## Up Sampling
balanced.disasters.up <- nat.disasters.mod3
max_num <- max(count(nat.disasters.mod3, Type)$n)
for (t in unique(nat.disasters.mod3$Type)) {
diff <- max_num - sum(nat.disasters.mod3$Type == t)
if (diff > 0) {
balanced.disasters.up <- rbind(balanced.disasters.up, slice_sample(filter(nat.disasters.mod3, nat.disasters.mod3$Type == t), n = diff, replace = TRUE))
}
}
train.ind.up <- sample(1:nrow(balanced.disasters.up), size = 0.7 * nrow(balanced.disasters.up), replace = FALSE)
train.up <- balanced.disasters.up[train.ind.up,]
test.up <- balanced.disasters.up[-train.ind.up,]
temp.co2.svm <- svm(as.factor(Type) ~ LandAverageTemperature + co2, data = train.up)
pred.svm <- predict(temp.co2.svm, test.up)
conf.svm <- confusionMatrix(as.factor(test.up$Type), pred.svm)
conf.svm
## Down Sampling
balanced.disasters.down <- data.frame()
for (t in unique(nat.disasters.mod3$Type)) {
if (count(filter(nat.disasters.mod3, nat.disasters.mod3$Type == t)) < 500) {
balanced.disasters.down <- rbind(balanced.disasters.down, slice_sample(filter(nat.disasters.mod3, nat.disasters.mod3$Type == t), n = 500, replace = TRUE))
}
else {
balanced.disasters.down <- rbind(balanced.disasters.down, slice_sample(filter(nat.disasters.mod3, nat.disasters.mod3$Type == t), n = 500, replace = FALSE))
}
}
train.ind.down <- sample(1:nrow(balanced.disasters.down), size = 0.7 * nrow(balanced.disasters.down), replace = FALSE)
train.down <- balanced.disasters.down[train.ind.down,]
test.down <- balanced.disasters.down[-train.ind.down,]
temp.co2.svm <- svm(as.factor(Type) ~ LandAverageTemperature + co2, data = train.down)
pred.svm <- predict(temp.co2.svm, test.down)
conf.svm <- confusionMatrix(as.factor(test.down$Type), pred.svm)
conf.svm
#print_output(conf.svm, cex = 0.4)